In this notebook, we will test correlations between dependent and explanatory variables, which is an important step for choosing the variables for our regression analysis. The goal of this notebook is therefore to find the combinations of dependent and explanatory variables that have high correlations. This can be done in two main ways:
# import modules, change cwd
import os
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import feather
import plotly.express as px
import psutil
import seaborn as sn
%matplotlib inline
Here we create prepare a dataframe that has the dependent and explanatory variables. The dependent variables will come from the LMA dataset, and explanatory variables will come from various dataset (such as CBS, BGT, UrbanAtlas). The main challenge here is to find explanatory variables that have a relatively high correlation to our dependent variables. We can find these suitable variables through literature review.
Relevant columns (see detailed explanations on CBS website):
# import CBS pc4 dataset
cbs = gpd.read_file('variables/CBS/CBS-PC4-2018-v3/CBS_PC4_2018_v3.shp')
# select and rename columns
cbs = cbs[['PC4', 'INWONER', 'AANTAL_HH', 'GEM_HH_GR', 'WONING', 'WOZWONING', 'M_INKHH','P_LINK_HH', 'P_HINK_HH',
'AFS_OPRIT', 'AFS_TRNOVS', 'AFS_TREINS',
'AFS_ONDHV', 'AV10_ONDHV', 'AFS_ONDVMB', 'AV10ONDVMB', 'OAD', 'STED']]
cbs.rename(columns={
'PC4': 'pc4',
'INWONER': 'pop',
'AANTAL_HH': 'numHh',
'GEM_HH_GR': 'avHhSize',
'WONING': 'numDwellings',
'WOZWONING': 'avHhPrice',
'M_INKHH': 'medHhIncome',
'P_LINK_HH': 'perLowIncome',
'P_HINK_HH': 'perHighIncome',
'AFS_OPRIT': 'kmMainRoad',
'AFS_TRNOVS': 'kmMainTrain',
'AFS_TREINS': 'kmTrain',
'AFS_ONDHV': 'kmHavo',
'AV10_ONDHV': '10kmHavo',
'AFS_ONDVMB': 'kmVmbo',
'AV10ONDVMB': '10kmVmbo',
'OAD': 'addressesPerKm2',
'STED': 'urbanCode'
}, inplace=True)
# replace numbers like -99999 to np.NaN
def to_nan(x):
if x < 0:
return np.NaN
else:
return x
columns = ['pop', 'numHh', 'avHhSize', 'numDwellings', 'avHhPrice',
'perLowIncome', 'perHighIncome', 'kmMainRoad', 'kmMainTrain', 'kmTrain',
'kmHavo', '10kmHavo', 'kmVmbo', '10kmVmbo', 'addressesPerKm2',
'urbanCode']
for column in columns:
cbs[column] = cbs[column].map(lambda x: to_nan(x))
Here, we create a dataframe that contains:
# create df with columns: pc4, log(kg), cbs variables for each pc4
# read afgifte dataframe
df = feather.read_dataframe('lma/af_2019-2020_matched-codes.feather')
df = df[~df.gnc.isna()] # extract rows with gnc
df.eaPostcode = df.eaPostcode.str[:4] # convert to pc4
df = df[df.year == 2019] # only looking at numbers in 2019
df.eaPostcode = df.eaPostcode.astype('int64')
df = df[['eaPostcode', 'gnc', 'kg']]
df = df.groupby(['eaPostcode', 'gnc']).sum().kg.reset_index()
df.rename(columns={'eaPostcode': 'postcode', 'kg': 'kg_received'}, inplace=True)
# create column with log kg values
def log(x):
if x > 0:
return np.log(x)
else:
return 0
df['log_kg'] = df.kg_received.map(lambda x: log(x))
print('afgifte rows, columns: {}'.format(df.shape))
print('cbs rows, columns: {}'.format(cbs.shape))
# merge with cbs data (which already has geometry)
df = pd.merge(df, cbs, how='right', left_on='postcode', right_on='pc4')
df.drop(labels=['postcode'], axis=1, inplace=True)
df['kg_received'] = df['kg_received'].fillna(0)
df['log_kg'] = df['log_kg'].fillna(0)
df = df[['pc4','kg_received', 'log_kg','gnc', 'pop', 'numHh', 'avHhSize', 'avHhPrice',
'numDwellings', 'medHhIncome', 'perLowIncome', 'perHighIncome',
'kmMainRoad', 'kmMainTrain', 'kmTrain', 'kmHavo', '10kmHavo', 'kmVmbo',
'10kmVmbo', 'addressesPerKm2', 'urbanCode']] # rearrange pc4 in front, drop kg_received
# display shape
print('merged (cbs+lma) rows, columns: {}'.format(df.shape))
# add geometry of postcodes to df
nl_pc4 = gpd.read_file('spatial-data/nl_pc4.shp')
nl_pc4 = nl_pc4[['PC4', 'geometry']]
nl_pc4.PC4 = nl_pc4.PC4.astype('int64')
print('nl_pc4 rows, columns: {}'.format(nl_pc4.shape))
df = pd.merge(df, nl_pc4, how='left', left_on='pc4', right_on='PC4')
# get centroids, x, y coordinates
df = gpd.GeoDataFrame(df)
df = df.to_crs('EPSG:28992')
df['centroids'] = df.geometry.centroid
df['x'] = df.centroids.x
df['y'] = df.centroids.y
print('merged (cbs+lma+nl_pc4) rows, columns: {}'.format(df.shape))
df.head(2)
The correlation matrix below shows correlations between all variables we've extracted. The lighter the color, the higher the correlation. The first two rows show the correlations between the dependent variables (kg_received and log_kg) and explanatory variables extracted from the cbs dataset.
df.gnc = df.gnc.fillna('-')
df25 = df[df.gnc.str.startswith('25')]
dfCorr = df25.drop(labels=['pc4', 'gnc', 'medHhIncome', 'PC4', 'geometry', 'centroids', 'x', 'y'], axis=1)
sn.heatmap(dfCorr.corr(), square=True, annot=False)
plt.show()
The first two rows of the scatter matrix below shows the relationship between kg_received and log(kg) (the dependent variables) with all other explanatory variables (such as km from main road, number of universities within a 10km radius...etc). As we can see, the scatterplots on the first row (kg_received) all have a similar shape - a downward curve. This probably has more do to with the log-distribution of values than any actual trends. The scatterplots on the second row (log(kg)) don't seem to be following any pattern - they are all blobs. This suggests that there are no significant relationships between total kg_received and any of the cbs data. There are two ways to proceed here: one would be to study kg_received of a specific material instead of all materials in general. Another would be to find other explanatory variables that might have a stronger relationships with kg_received.
sn.pairplot(dfCorr)
plt.show()
# save as csv for mgwr tool
df_0 = df[df.log_kg > 0]
df_0.to_csv('lma/af_2019_no0_for-GWR.csv')
df.to_csv('lma/af_2019_for-GWR.csv')
Before doing a regression analysis, we need to choose the right explanatory variables. In other words, we need to choose variables that will more likely have a significant effect on our dependent variable, log(kg). We can choose variables by reading existing literature (we've made a start with our previous paper), but we can also do this using tools. One of these tools is the exploratory regression tool developed by ArcGIS. This tool finds a combination of explanatory variables that best explains the dependent variable. It does this by first finding the best variable, then the best combination of two variables, then three variables...and so on, until a passing model is found. Each iteration is evaluated using a number a statistical tests - AICc, VIF, JB, and R2.
Explanation of the statistical tests:
I haven't done this step yet.